feat: implement self-healing recovery mechanism for gaps in L1 data #403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

jonastheis wants to merge 19 commits into main from feat/self-healing-l1-events

Contributor

jonastheis commented Oct 30, 2025 •

edited

Loading

This PR implements a self-healing gap recovery mechanism for L1 messages and batch events. The actual gap detection happens in the ChainOrchestrator which subsequently notifies the L1Watcher that it needs to reset.

Specifically, the following changes are implemented:

detect gaps and duplicate L1 messages and commit batch events
handle detected gaps in ChainOrchestrator and reset L1Watcher
implement command receiver in L1Watcher to be able to reset to a certain sync height
- making sure that there's no deadlock as L1Watcher blocks if the send channel is full
special edge case of missing L1 message for commit batch: batches are retried in derivation pipeline if an L1 message is missing. eventually the L1 message should be recovered and the batch processed.


          feat: implement gap recovery mechanism for L1 watcher and use in Chai…

b509314

…nOrchestrator

codspeed-hq bot commented Oct 30, 2025 •

edited

Loading

CodSpeed Performance Report

Merging #403 will not alter performance

_{Comparing feat/self-healing-l1-events (6a23c25) with main (e7ba7aa)}

Summary

✅ 2 untouched

jonastheis added 11 commits

October 30, 2025 14:23


          make sure that there's no deadlock with command receiver as L1Watcher…

0ea4ef7

… blocks if the send channel is full


          feat: add skipping logic for duplicate L1 messages and batch commits …

5670af8

…in ChainOrchestrator


          remove todo

ba20206


          use select in watcher main loop

476d906


          add test to test reset functionality

f6eaf09


          add test for preventing deadlock if send channel is full

21588bc

fmt

c907bd4


          add initial test setup

10bc36c


          add L1WatcherHandleTrait for easier testability

51100a5


          fix deadlock in test

46c09f9


          add testing of gap recovery for batch

b96bda5

jonastheis marked this pull request as ready for review

November 4, 2025 09:53

jonastheis requested a review from frisitano

November 4, 2025 09:53

jonastheis added 4 commits

November 5, 2025 08:04


          fix lint

abcc90b


          fix watcher tests

937b0e0


          add possibility to filter by processed to get_batch_by_index

f15ffb9


          make test easier to debug by failing instead of hanging

02fb909

frisitano reviewed

View reviewed changes

Collaborator

frisitano left a comment

Added some comments inline.

crates/node/src/args.rs

    
                          // testing

                          #[cfg(feature = "test-utils")]

                          {

                              let (tx, rx) = tokio::sync::mpsc::channel(1000);

Collaborator

frisitano Nov 5, 2025

Can we create a L1 watcher handle and receiver channel here that can be used for testing?

crates/database/db/src/operations.rs Outdated

    
                  async fn get_batch_by_index(

                      &self,

                      batch_index: u64,

                      processed: Option<bool>,

Collaborator

frisitano Nov 5, 2025

What's the purpose of adding the processed filter?

Contributor Author

jonastheis Nov 5, 2025

Thought I needed it at some point. reverted.

crates/watcher/src/handle/command.rs Outdated

    
                      /// New sender to replace the current notification channel

                      new_sender: mpsc::Sender<Arc<L1Notification>>,

                      /// Oneshot sender to signal completion of the reset operation

                      response_sender: oneshot::Sender<()>,

Collaborator

frisitano Nov 5, 2025

Why is this needed?

Contributor Author

jonastheis Nov 5, 2025

not really needed. removed.

crates/watcher/src/handle/mod.rs Outdated

    
              /// This trait allows the chain orchestrator to send commands to the L1 watcher,

              /// primarily for gap recovery scenarios.

              #[async_trait::async_trait]

              pub trait L1WatcherHandleTrait: Send + Sync + 'static {

Collaborator

frisitano Nov 5, 2025

What value does a trait add here as opposed to using a concrete type? Do we intend to have some sort of genericness on the handle?

Contributor Author

jonastheis Nov 5, 2025

yeah thought I'd need it for testing. removed now.

crates/watcher/src/handle/mod.rs Outdated

Comment on lines 90 to 93

    
              pub struct MockL1WatcherHandle {

                  /// Track reset calls as (`block_number`, `channel_capacity`)

                  resets: Arc<std::sync::Mutex<Vec<(u64, usize)>>>,

              }

Collaborator

frisitano Nov 5, 2025

Why do we need this? Can't we just inspect the receiver channel directly? I think we would then be able to remove MockL1WatcherHandle and the L1WatcherHandleTrait and just use the L1WatcherHandle directly. I think this would result in simpler code.

Contributor Author

jonastheis Nov 5, 2025

yeah thought I'd need it for testing. removed now.

crates/chain-orchestrator/src/lib.rs Outdated

    
                  /// A receiver for [`L1Notification`]s from the [`rollup_node_watcher::L1Watcher`].

                  l1_notification_rx: Receiver<Arc<L1Notification>>,

                  /// Handle to send commands to the L1 watcher (e.g., for gap recovery).

                  l1_watcher_handle: Option<H>,

Collaborator

frisitano Nov 5, 2025

Why is this optional?

Contributor Author

jonastheis Nov 5, 2025

removed the option

crates/chain-orchestrator/src/lib.rs Outdated Show resolved Hide resolved

crates/chain-orchestrator/src/lib.rs

Comment on lines 569 to 599

    
                              ) {

                                  Err(ChainOrchestratorError::L1MessageQueueGap(queue_index)) => {

                                      // Query database for the L1 block of the last known L1 message

                                      let reset_block =

                                          self.database.get_last_l1_message_l1_block().await?.unwrap_or(0);

                                      // TODO: handle None case (no messages in DB)

                                      tracing::warn!(

                                          target: "scroll::chain_orchestrator",

                                          "L1 message queue gap detected at index {}, last known message at L1 block {}",

                                          queue_index,

                                          reset_block

                                      );

                                      // Trigger gap recovery

                                      self.trigger_gap_recovery(reset_block, "L1 message queue gap").await?;

                                      // Return no event, recovery will re-process

                                      Ok(None)

                                  }

                                  Err(ChainOrchestratorError::DuplicateL1Message(queue_index)) => {

                                      tracing::info!(

                                          target: "scroll::chain_orchestrator",

                                          "Duplicate L1 message detected at {:?}, skipping",

                                          queue_index

                                      );

                                      // Return no event, as the message has already been processed

                                      Ok(None)

                                  }

                                  result => result,

                              }

Collaborator

frisitano Nov 5, 2025

Why don't we embed this logic inside of handle_l1_message?

crates/chain-orchestrator/src/lib.rs

Comment on lines 532 to 562

    
                              match metered!(Task::BatchCommit, self, handle_batch_commit(batch.clone())) {

                                  Err(ChainOrchestratorError::BatchCommitGap(batch_index)) => {

                                      // Query database for the L1 block of the last known batch

                                      let reset_block =

                                          self.database.get_last_batch_commit_l1_block().await?.unwrap_or(0);

                                      // TODO: handle None case (no batches in DB)

                                      tracing::warn!(

                                          target: "scroll::chain_orchestrator",

                                          "Batch commit gap detected at index {}, last known batch at L1 block {}",

                                          batch_index,

                                          reset_block

                                      );

                                      // Trigger gap recovery

                                      self.trigger_gap_recovery(reset_block, "batch commit gap").await?;

                                      // Return no event, recovery will re-process

                                      Ok(None)

                                  }

                                  Err(ChainOrchestratorError::DuplicateBatchCommit(batch_info)) => {

                                      tracing::info!(

                                          target: "scroll::chain_orchestrator",

                                          "Duplicate batch commit detected at {:?}, skipping",

                                          batch_info

                                      );

                                      // Return no event, as the batch has already been processed

                                      Ok(None)

                                  }

                                  result => result,

                              }

Collaborator

frisitano Nov 5, 2025

Why don't we embedd this logic in handle_batch_commit?

crates/chain-orchestrator/src/lib.rs Outdated

    
                  /// # Arguments

                  /// * `reset_block` - The L1 block number to reset to (last known good state)

                  /// * `gap_type` - Description of the gap type for logging

                  async fn trigger_gap_recovery(

Collaborator

frisitano Nov 5, 2025

If we embed the L1Notification channel inside of the L1WatcherHandle then we can implement this logic on the L1WatcherHandle directly enabling better encapsulation.

Contributor Author

jonastheis Nov 6, 2025

Done in dce07df

jonastheis added 3 commits

November 5, 2025 15:58


          Revert "add possibility to filter by processed to get_batch_by_index"

49d38e5

This reverts commit f15ffb9.


          address review comments

6a23c25


          embed L1Notification channel receiver inside of the L1WatcherHandle

dce07df

jonastheis marked this pull request as draft

November 6, 2025 05:32

jonastheis mentioned this pull request

Revisit setting safe block before batch is entirely processed #411

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet